For most people, philosophy maybe obscure and confusing. It usually takes lots of time to understand the overall conceptions between different schools of philosophy. However, taking advantage of this project, we can hand over this problem to our computer, solving this by NLP and machine learning. In this project, exploratory data analysis was first conducted to get some insights hidden in the dataset by visualizing. Then, according to the matrix of token counts, different schools of philosophy were clustered together to understand the similarity between them. In the last step, by using sentiment analysis classification, our algorithm acquired the ability to predict the corresponding school after seeing the sentence only.
# Packages
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
import dataframe_image as dfi
from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import random
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram,linkage
# Read data
df = pd.read_csv("../Project 1/philosophy_data.csv")
# Structure of data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 360808 entries, 0 to 360807 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 360808 non-null object 1 author 360808 non-null object 2 school 360808 non-null object 3 sentence_spacy 360808 non-null object 4 sentence_str 360808 non-null object 5 original_publication_date 360808 non-null int64 6 corpus_edition_date 360808 non-null int64 7 sentence_length 360808 non-null int64 8 sentence_lowered 360808 non-null object 9 tokenized_txt 360808 non-null object 10 lemmatized_str 360808 non-null object dtypes: int64(3), object(8) memory usage: 30.3+ MB
# Preview
df.head(10)
| title | author | school | sentence_spacy | sentence_str | original_publication_date | corpus_edition_date | sentence_length | sentence_lowered | tokenized_txt | lemmatized_str | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Plato - Complete Works | Plato | plato | What's new, Socrates, to make you leave your ... | What's new, Socrates, to make you leave your ... | -350 | 1997 | 125 | what's new, socrates, to make you leave your ... | ['what', 'new', 'socrates', 'to', 'make', 'you... | what be new , Socrates , to make -PRON- lea... |
| 1 | Plato - Complete Works | Plato | plato | Surely you are not prosecuting anyone before t... | Surely you are not prosecuting anyone before t... | -350 | 1997 | 69 | surely you are not prosecuting anyone before t... | ['surely', 'you', 'are', 'not', 'prosecuting',... | surely -PRON- be not prosecute anyone before ... |
| 2 | Plato - Complete Works | Plato | plato | The Athenians do not call this a prosecution b... | The Athenians do not call this a prosecution b... | -350 | 1997 | 74 | the athenians do not call this a prosecution b... | ['the', 'athenians', 'do', 'not', 'call', 'thi... | the Athenians do not call this a prosecution ... |
| 3 | Plato - Complete Works | Plato | plato | What is this you say? | What is this you say? | -350 | 1997 | 21 | what is this you say? | ['what', 'is', 'this', 'you', 'say'] | what be this -PRON- say ? |
| 4 | Plato - Complete Works | Plato | plato | Someone must have indicted you, for you are no... | Someone must have indicted you, for you are no... | -350 | 1997 | 101 | someone must have indicted you, for you are no... | ['someone', 'must', 'have', 'indicted', 'you',... | someone must have indict -PRON- , for -PRON- ... |
| 5 | Plato - Complete Works | Plato | plato | But someone else has indicted you? | But someone else has indicted you? | -350 | 1997 | 34 | but someone else has indicted you? | ['but', 'someone', 'else', 'has', 'indicted', ... | but someone else have indict -PRON- ? |
| 6 | Plato - Complete Works | Plato | plato | I do not really know him myself, Euthyphro. | I do not really know him myself, Euthyphro. | -350 | 1997 | 43 | i do not really know him myself, euthyphro. | ['do', 'not', 'really', 'know', 'him', 'myself... | -PRON- do not really know -PRON- -PRON- , Eut... |
| 7 | Plato - Complete Works | Plato | plato | He is apparently young and unknown. | He is apparently young and unknown. | -350 | 1997 | 35 | he is apparently young and unknown. | ['he', 'is', 'apparently', 'young', 'and', 'un... | -PRON- be apparently young and unknown . |
| 8 | Plato - Complete Works | Plato | plato | They call him Meletus, I believe. | They call him Meletus, I believe. | -350 | 1997 | 33 | they call him meletus, i believe. | ['they', 'call', 'him', 'meletus', 'believe'] | -PRON- call -PRON- Meletus , -PRON- believe . |
| 9 | Plato - Complete Works | Plato | plato | He belongs to the Pitthean deme, if you know a... | He belongs to the Pitthean deme, if you know a... | -350 | 1997 | 147 | he belongs to the pitthean deme, if you know a... | ['he', 'belongs', 'to', 'the', 'pitthean', 'de... | -PRON- belong to the Pitthean deme , if -PRON... |
# Check if this dataset contain duplicates
df.duplicated().sum()
0
# Check how many kinds of titles in this dataset. And plot the distribution of titles.
print(f'There are {len(df.title.unique())} kinds of titles in total. \nThe distribution of titles are plotted below:')
df.title.value_counts().plot.bar(figsize = (15,5), title = 'titles');
There are 59 kinds of titles in total. The distribution of titles are plotted below:
By this plot, we find the number of 'Aristotle - Complete Works' and 'Plato - Complete Works' are the most.
# Check how many kinds of authors in this dataset. And plot the distribution of authors.
print(f'There are {len(df.author.unique())} kinds of authors in total. \nThe distribution authors are plotted below:')
df.author.value_counts().plot.bar(figsize = (15,5), title = 'authors');
There are 36 kinds of authors in total. The distribution authors are plotted below:
Similar to the previous one, among all authors, the number of 'Aristotle' and 'Plato' are the most.
# Check how many kinds of schools in this dataset. And plot the distribution of schools.
print(f'There are {len(df.school.unique())} kinds of schools in total. \nThe distribution schools are plotted below:')
df.school.value_counts().plot.bar(figsize = (15,5), title = 'schools');
There are 13 kinds of schools in total. The distribution schools are plotted below:
This diagram shows the number of 'stoicism' is the least.
# Check the distribution of sentence length
print(df.sentence_length.describe())
df.sentence_length.plot.hist(bins = 100, figsize = (15,5), title = 'sentence length');
count 360808.000000 mean 150.790964 std 104.822072 min 20.000000 25% 75.000000 50% 127.000000 75% 199.000000 max 2649.000000 Name: sentence_length, dtype: float64
By this table and diagram, we can find the distribution of sentence length. Most of the sentence lengths are in the range between 0 and 500. The mean and standard deviation of sentence length are 151 and 105, respectively.
# Check the distribution of original publication date
df.original_publication_date.plot.hist(bins = 100, figsize = (15,5), title = 'original publication date');
According to the above diagram, most original publication dates are between 1500 and 2000. A small proportion are located at 350 BC.
# Plot the distribution of sentence length by title, author, and school.
fig1, axs1 = plt.subplots(3, figsize = (20,40))
fig1.subplots_adjust(hspace=0.8)
df.boxplot(column = 'sentence_length', by = 'title', rot = 90, ax = axs1[0]);
df.boxplot(column = 'sentence_length', by = 'author', rot = 90, ax = axs1[1]);
df.boxplot(column = 'sentence_length', by = 'school', rot = 90, ax = axs1[2]);
axs1[0].set_title('The distribution of sentence length by title', fontsize = 15, y = 1.05)
axs1[1].set_title('The distribution of sentence length by author', fontsize = 15, y = 1.05)
axs1[2].set_title('The distribution of sentence length by school', fontsize = 15, y = 1.05)
axs1[0].set_ylabel('sentence length')
axs1[1].set_ylabel('sentence length')
axs1[2].set_ylabel('sentence length')
fig1.suptitle('');
According to the above three boxplots, sentences in 'Discourse on methods' and 'Second treatise on goverment' tend to have a bigger length. All the other titles, authors, and schools tend to express opinions using similar length.
# Plot the distribution of original publication date by school to see in which century it is.
df.boxplot(column = 'original_publication_date', by = 'school', rot = 90, figsize = (20, 10));
plt.suptitle('The distribution of original publication date by school', fontsize = 15, y = 0.97);
plt.title('');
plt.ylabel('original publication date');
From this boxplot, we find 'aristotle', 'plato', and 'stoicism' have a relatively longer history. All the others are located in a similar period
# Check the mean original publication date and mean sentence length grouped by school, author, and title.
table = df.groupby(by = ['author', 'title'], as_index = False)[['original_publication_date', 'sentence_length']].mean().set_index('author')\
.join(df.groupby(by = ['school', 'author'], as_index = False)\
[['original_publication_date', 'sentence_length']].mean().set_index('author'),\
how = 'left',
lsuffix = '_by_title').reset_index().set_index('school')\
.join(df.groupby(by = 'school')[['original_publication_date', 'sentence_length']].mean(),
how = 'left',
lsuffix = '_by_author',
rsuffix = '_by_school').reset_index().round({'original_publication_date_by_school': 0,
'original_publication_date_by_author': 0,
'original_publication_date_by_title': 0,
'sentence_length_by_school': 2,
'sentence_length_by_author': 2,
'sentence_length_by_title': 2})\
.astype({'original_publication_date_by_school': 'int32',
'original_publication_date_by_author': 'int32',
'original_publication_date_by_title': 'int32'})\
.set_index(['school',
'original_publication_date_by_school',
'sentence_length_by_school',
'author',
'original_publication_date_by_author',
'sentence_length_by_author',
'title'])
dfi.export(table,"mytable.png")
table
| original_publication_date_by_title | sentence_length_by_title | |||||||
|---|---|---|---|---|---|---|---|---|
| school | original_publication_date_by_school | sentence_length_by_school | author | original_publication_date_by_author | sentence_length_by_author | title | ||
| analytic | 1959 | 119.03 | Kripke | 1974 | 119.03 | Naming And Necessity | 1972 | 120.57 |
| Philosophical Troubles | 1975 | 118.60 | ||||||
| Lewis | 1985 | 109.72 | Lewis - Papers | 1985 | 109.72 | |||
| Moore | 1910 | 167.25 | Philosophical Studies | 1910 | 167.25 | |||
| Popper | 1959 | 139.55 | The Logic Of Scientific Discovery | 1959 | 139.55 | |||
| Quine | 1950 | 121.64 | Quintessence | 1950 | 121.64 | |||
| Russell | 1918 | 146.30 | The Analysis Of Mind | 1921 | 142.64 | |||
| The Problems Of Philosophy | 1912 | 154.54 | ||||||
| Wittgenstein | 1948 | 84.88 | On Certainty | 1950 | 79.38 | |||
| Philosophical Investigations | 1953 | 83.58 | ||||||
| Tractatus Logico-Philosophicus | 1921 | 100.19 | ||||||
| aristotle | -320 | 153.22 | Aristotle | -320 | 153.22 | Aristotle - Complete Works | -320 | 153.22 |
| capitalism | 1813 | 187.58 | Keynes | 1936 | 196.65 | A General Theory Of Employment, Interest, And Money | 1936 | 196.65 |
| Ricardo | 1817 | 186.25 | On The Principles Of Political Economy And Taxation | 1817 | 186.25 | |||
| Smith | 1776 | 185.28 | The Wealth Of Nations | 1776 | 185.28 | |||
| communism | 1877 | 152.75 | Lenin | 1862 | 181.42 | Essential Works Of Lenin | 1862 | 181.42 |
| Marx | 1882 | 143.25 | Capital | 1883 | 142.97 | |||
| The Communist Manifesto | 1848 | 150.68 | ||||||
| continental | 1966 | 171.79 | Deleuze | 1970 | 163.67 | Anti-Oedipus | 1972 | 165.51 |
| Difference And Repetition | 1968 | 161.58 | ||||||
| Derrida | 1967 | 143.43 | Writing And Difference | 1967 | 143.43 | |||
| Foucault | 1963 | 189.64 | History Of Madness | 1961 | 174.42 | |||
| The Birth Of The Clinic | 1963 | 184.99 | ||||||
| The Order Of Things | 1966 | 218.20 | ||||||
| empiricism | 1716 | 183.64 | Berkeley | 1712 | 139.65 | A Treatise Concerning The Principles Of Human Knowledge | 1710 | 184.72 |
| Three Dialogues | 1713 | 111.98 | ||||||
| Hume | 1745 | 180.19 | A Treatise Of Human Nature | 1739 | 183.01 | |||
| Dialogues Concerning Natural Religion | 1779 | 164.51 | ||||||
| Locke | 1689 | 200.40 | Essay Concerning Human Understanding | 1689 | 190.59 | |||
| Second Treatise On Government | 1689 | 266.79 | ||||||
| feminism | 1933 | 153.08 | Beauvoir | 1949 | 148.79 | The Second Sex | 1949 | 148.79 |
| Davis | 1981 | 139.67 | Women, Race, And Class | 1981 | 139.67 | |||
| Wollstonecraft | 1792 | 190.96 | Vindication Of The Rights Of Woman | 1792 | 190.96 | |||
| german_idealism | 1803 | 180.25 | Fichte | 1798 | 151.96 | The System Of Ethics | 1798 | 151.96 |
| Hegel | 1815 | 175.72 | Elements Of The Philosophy Of Right | 1820 | 161.01 | |||
| Science Of Logic | 1817 | 187.17 | ||||||
| The Phenomenology Of Spirit | 1807 | 168.70 | ||||||
| Kant | 1785 | 198.16 | Critique Of Judgement | 1790 | 211.98 | |||
| Critique Of Practical Reason | 1788 | 175.38 | ||||||
| Critique Of Pure Reason | 1781 | 197.86 | ||||||
| nietzsche | 1887 | 116.60 | Nietzsche | 1887 | 116.60 | Beyond Good And Evil | 1886 | 188.08 |
| Ecce Homo | 1888 | 133.98 | ||||||
| The Antichrist | 1888 | 133.34 | ||||||
| Thus Spake Zarathustra | 1887 | 80.56 | ||||||
| Twilight Of The Idols | 1888 | 126.84 | ||||||
| phenomenology | 1938 | 145.91 | Heidegger | 1937 | 118.54 | Being And Time | 1927 | 126.47 |
| Off The Beaten Track | 1950 | 108.53 | ||||||
| Husserl | 1931 | 185.47 | The Crisis Of The European Sciences And Phenomenology | 1936 | 192.05 | |||
| The Idea Of Phenomenology | 1907 | 150.56 | ||||||
| Merleau-Ponty | 1945 | 170.93 | The Phenomenology Of Perception | 1945 | 170.93 | |||
| plato | -350 | 114.94 | Plato | -350 | 114.94 | Plato - Complete Works | -350 | 114.94 |
| rationalism | 1681 | 163.96 | Descartes | 1640 | 247.38 | Discourse On Method | 1637 | 375.60 |
| Meditations On First Philosophy | 1641 | 192.34 | ||||||
| Leibniz | 1710 | 157.09 | Theodicy | 1710 | 157.09 | |||
| Malebranche | 1674 | 164.43 | The Search After Truth | 1674 | 164.43 | |||
| Spinoza | 1677 | 146.54 | Ethics | 1677 | 142.07 | |||
| On The Improvement Of Understanding | 1677 | 176.80 | ||||||
| stoicism | 164 | 137.06 | Epictetus | 125 | 118.43 | Enchiridion | 125 | 118.43 |
| Marcus Aurelius | 170 | 139.78 | Meditations | 170 | 139.78 |
This table lists precisely the mean sentence length and original publication date by each school, authors, and titles. Similar insights can be generated from this table.
# Plot the wordclouds of each school
stopwords = set(STOPWORDS)
fig2, axs2 = plt.subplots(7,2, figsize = (20, 100))
fig2.tight_layout()
axs2 = axs2.ravel()
fig2.delaxes(axs2[13])
school_list = list(df.school.unique())
for i in school_list:
corpus_i = df.loc[df.school == i, 'sentence_str'].tolist()
corpus_i = ''.join(corpus_i)
wordcloud = WordCloud(width = 800,
height = 1000,
background_color = 'white',
stopwords = stopwords,
min_font_size = 10).generate(corpus_i)
axs2[school_list.index(i)].imshow(wordcloud)
axs2[school_list.index(i)].set_title('Wordcloud of ' + i, fontsize = 20, y = 1.02);
axs2[school_list.index(i)].axis('off')
Wordcloud of each school of philosophy were plotted above, from which we are able to find the word pattern between different schools
# Create a new corpus grouped by schools
corpus_by_school = []
for i in school_list:
corpus_i = df.loc[df.school == i, 'sentence_str'].tolist()
corpus_i = ''.join(corpus_i)
corpus_by_school.append(corpus_i)
# Use 'TfidfVectorizer' to convert the new corpus into a matrix of normalized token counts
TfidfVec = TfidfVectorizer(lowercase = True,
min_df = 5,
max_df = 0.8,
ngram_range = (1,1))
X_vec = TfidfVec.fit_transform(corpus_by_school)
X_vec_df = pd.DataFrame.sparse.from_spmatrix(X_vec,
index = school_list,
columns = list(dict(sorted(TfidfVec.vocabulary_.items(), key = lambda x: x[1])).keys()))
# Use HAC to cluster all the schools and plot the dendrogram
HAC = linkage(pdist(X_vec_df, metric = 'euclidean'), method = 'complete')
plt.figure(figsize=(25, 15))
dendrogram(HAC,
orientation = 'right',
labels = school_list,
distance_sort = 'descending',
show_leaf_counts = False)
plt.show()
Hierarchical clustering was conducted in this step. According to the dendrogram, we find that 'stoicism' and 'nietzsche' are the most similar schools. After this pair, 'communism' and 'capitalism' are most similar.
# Define the corpus
corpus = list(df.sentence_str)
# Use 'OrdinalEncoder' to transfer schools into numbers
encoder = OrdinalEncoder()
Y_2d = encoder.fit_transform(np.array(df.school).reshape(len(df.school),1))
Y = Y_2d.reshape(len(df.school),).astype('int64')
# Train & Test split
X_train, X_test, Y_train, Y_test = train_test_split(corpus, Y)
# Use a pipeline to wrap vectorizer and classifier together
# After fitting the pipeline, calculate the cross validation accuracy and test set accuracy of the model
pipe_sa = Pipeline([('TfidfVec', TfidfVectorizer(lowercase = True,
min_df = 5,
max_df = 0.8,
ngram_range = (1,1))),
('lr', LogisticRegression(C = 1,
penalty = 'l2',
multi_class = 'multinomial',
max_iter = 1000))])
pipe_sa.fit(X_train, Y_train)
cv_score_pipe = cross_val_score(pipe_sa, X_train, Y_train)
print(f'pipe cv accuracy: {cv_score_pipe.mean():0.2f}')
test_score_pipe = pipe_sa.score(X_test, Y_test)
print(f'pipe test set accuracy: {test_score_pipe:0.2f}')
pipe cv accuracy: 0.75 pipe test set accuracy: 0.77
# Visualize confusion matrix
Y_test_pred = pipe_sa.predict(X_test)
# Decode our test set and the prediction on test set to get corresponding school
Y_test_school = encoder.inverse_transform(Y_test.reshape(len(Y_test),1)).reshape(len(Y_test),)
Y_test_pred_school = encoder.inverse_transform(Y_test_pred.reshape(len(Y_test_pred),1)).reshape(len(Y_test_pred),)
cm = pd.crosstab(encoder.inverse_transform(Y_test_pred.reshape(len(Y_test_pred),1)).reshape(len(Y_test_pred),),
encoder.inverse_transform(Y_test.reshape(len(Y_test),1)).reshape(len(Y_test),),
rownames = ['prediction'],
colnames = ['school'])
plt.figure(figsize=(10,8))
sns.heatmap(data = cm, annot=True, fmt='g', cmap='Blues');
# Randomly select some samples, and check whether our model is able to predict the school they belong to correctly
random.seed(123)
sample_list = random.sample(range(len(df)),10)
df.loc[sample_list,['school', 'sentence_str']]
| school | sentence_str | |
|---|---|---|
| 27453 | plato | We're agreed about imitators, then. |
| 140339 | analytic | Even if the words which I say are those which,... |
| 45710 | aristotle | Perhaps the necessary is present also in the d... |
| 213511 | continental | If this constitutes a system of writing, it is... |
| 139750 | analytic | For if we too in these investigations are tryi... |
| 56465 | aristotle | All animals are furnished with fat, either int... |
| 20003 | plato | , that I hold to be worth all the other contes... |
| 198770 | continental | But all these positive elements which constitu... |
| 281124 | german_idealism | heingand its own reality. |
| 294816 | communism | Hence, the Shylock law of the Ten Tables. |
# Create a new test set
X_test_new = df.sentence_str.loc[sample_list].tolist()
Y_test_new = df.school.loc[sample_list].tolist()
Y_predict = pipe_sa.predict(X_test_new) # Use our model to make prediction on the new test set
Y_predict_school = encoder.inverse_transform(Y_predict.reshape(10,1)).reshape(10,) # Decode our prediction to get corresponding school
prediction = pd.DataFrame(data = {'sentence_str': X_test_new,
'school': Y_test_new,
'school_predict': Y_predict_school,
'predict_outcome': Y_test_new == Y_predict_school},
index = sample_list)
prediction
| sentence_str | school | school_predict | predict_outcome | |
|---|---|---|---|---|
| 27453 | We're agreed about imitators, then. | plato | plato | True |
| 140339 | Even if the words which I say are those which,... | analytic | analytic | True |
| 45710 | Perhaps the necessary is present also in the d... | aristotle | aristotle | True |
| 213511 | If this constitutes a system of writing, it is... | continental | continental | True |
| 139750 | For if we too in these investigations are tryi... | analytic | phenomenology | False |
| 56465 | All animals are furnished with fat, either int... | aristotle | aristotle | True |
| 20003 | , that I hold to be worth all the other contes... | plato | plato | True |
| 198770 | But all these positive elements which constitu... | continental | continental | True |
| 281124 | heingand its own reality. | german_idealism | german_idealism | True |
| 294816 | Hence, the Shylock law of the Ten Tables. | communism | german_idealism | False |
In the last step, all sentence strings were converted into matrix of token number, on which logistic regression classifier was fitted. On the test set, our model accuracy is about 77%. Then 10 sentences were randomly selected to test the accuracy of our model in reality. As a result, among these 10 samples, only two sentences were predicted incorrectly. According to our previous conclusion, 'analytic' and 'phenomenology' are very similar. Therefore, for the 139750th sample, this wrong prediction is relatively pardonable.